De Novo Genome Assembly ◾ 95
e=2 \
s=200 \
v=-v \
in=’fastq/ERR1007381_1.fastq.gz fastq/ERR1007381_2.fastq.gz’ \
scaffolds stats
The “name=ecoli” specifies the prefix string as “ecoli”, “j=4” specifies the number of
processors to be used, “k=25” specifies the length of the k-mer (substring), “c=255” specifies
the minimum mean k-mer coverage of a contig (a high-confidence contig), “e=2” specifies
the minimum erosion k-mer coverage, “s=200” specifies the minimum contig size (bp)
required for building scaffolds, “v=-v” enables verbose, and “in=” specifies the FASTQ file
names as an inputs.
When the execution of the above commands on the Linux terminal is complete, the
assembly statistics will be displayed. Assembling a genome with “abyss-pe” command is
performed in three stages: assembling contigs without paired-end information, aligning
the paired-end reads to an initial assembly, and finally merging contigs joined by paired-
end information. Multiple files will be generated for those three stages. The descriptions of
these files are as follows:
• A file with “*.dot” extension is a graph file in Graphviz format in which graphs (nodes
and edges) are defined in DOT language script. We can visualize this file by drawing
the graphs with the Graphviz program, which can be installed on Linux using “sudo
apt install graphviz”. We can then draw the graphs and save it in the PNG format using:
dot -Tpng ecoli-5.dot -o ecoli-5.png
• A file with “*.hist” extension is a histogram file made of two tab-separated columns:
the fragment size and count. It shows the distribution of the fragment sizes.
• A file with “*.path” extension is an ABySS graph path file, which describes how
sequences should be joined to form new sequences.
• A file with “*.sam.gz” extension is a SAM file that is used by ABySS to describe align-
ments of reads to assemble sequences at different stages of the assembly.
• A file with “*.fa” extension is a FASTA file format for contig and scaffold sequences.
The definition line of the FASTA file consists of three parts: <SEQ_ID> <SEQ_LEN>
<KMERS>, where SEQ_ID is a unique identifier for the sequence assigned by ABySS,
SEQ_LEN is the length of the sequence in bases, and KMERS is the number of
KMERS that mapped to the sequence in the assembly.
• A file with “*.fai” extension is an indexing file for the index of the corresponding
FASTA sequences. It includes five tab-separated columns (name of the sequence,
length of the sequence, offset of the first base in the file, number of bases in each line,
and number of bytes in each line).